Los Angeles is a city of around 4 Mio1 inhabitants from diverse backgrounds (about half2 of the population is latino) and had a crime rate of about 6353 per 100’000 inhabitants per year in 2015. The city hosts the biggest port complex in the US and is an illegal drugs hub4 for the country. Finding a way to foster this diversity so as to take advantage of the city’s economical potential and reduce crime would provide more room for children to develop, achieve high education levels and freely roam around the city.
The aim of this paper is to identify how demographic data at the zip code level in Los Angeles impact crime rates in 2010. More precisely, the two following hypotheses are formulated and tested here:
For the purpose of the study, three data sets are used:
Crime Data from 2010 to Present5 A dataset reflecting incidents of crime and their latitude-Longitude coordinates in the City of Los Angeles dating back to 2010 (last update : August 9, 2018). The data is gathered by the LAPD and contains the following information: date and time of occurence and report of the crime, type of crime, demographic information about the victim and location of the crime.
2010 Census Populations by Zip Code6 A dataset coming from from the 2010 Census Profile of General Population and Housing Characteristics. It contains data by zip code that fall at least partially within LA city boundaries. It contains the following information: total population, median age, number of males and females, number of households and average household size.
Tiger Lines Zip codes7 A shapefile containing areas of zip codes in the US. A subset of the shapefile is coded in prep.R for the city of Los Angeles.
In the following, the data is explored taking into account the spatial nature of the data, i.e. bivariate spatial maps, centrographic statistics, spatial error/lag weighted OLS model and geographically weighted model. First, an exploratory data analysis is performed in order to assess the distribution of the variables of interest, followed by a non-spatial regression analysis.
## Loading and formatting data
p <- read.csv("../Research Data/2010_Census_Populations_by_Zip_Code.csv")
load("../Research Data/crime.RData")
load(file = "../Research Data/zipCrimes.RData")
names(zc)[2] <- names(p)[1] <- "zip"
zCrimes <- zc
zCrimes@data <- merge(zc@data, p, by = "zip")
# zCrimes <- zCrimes[!is.na(zCrimes$crimes),]
# writeOGR(obj=zCrimes, driver="ESRI Shapefile", "../Research Data/zCrimes")
The histograms below show that average household size and median age are centered around 3 and 40 respectively. Crime rate has a high number of 1, 2 and 3 values (the underlying data has no 0 values). This might be due to the lack of data in certain regions. Those 1s values were not dismissed in order to not dismiss zip code areas arbitrarily in the final analysis, instead the spatial plots use quantiles, which allow to categorize all low values together and allow for comparison with higher and more reasonable estimates of crime rates.
## histograms
par(mfrow=c(1,3))
hist(zCrimes$crimes, xlab = "crimes since 2010", main=NULL)
hist(zCrimes$Average.Household.Size, xlab = "average household size in 2010", main=NULL)
hist(zCrimes$Median.Age, xlab = "median age in 2010", main=NULL)
The interaction between the explanatory variables (median age and average household size) and the dependent variable (number of crimes per zip code) can be seen on the plot below. The natural log of number of crimes was used in order to standardize the data, since it is originally left skewed. It seems that that no particular relationship exists on a first look, this will be tested by the regressions further below.
## scatter plots
par(mfrow=c(1,2))
plot(zCrimes$Median.Age, log(zCrimes$crimes), xlab = "median age", ylab = "ln(crimes)")
plot(zCrimes$Average.Household.Size, log(zCrimes$crimes), , xlab = "average household size", ylab = "ln(crimes)")
The spatial data delivers more detailed information on certain underlying relationships. Under the hypothesis stated in the introduction, one would expect that the more a circle tends to red (i.e. the higher the average household size), the more an area tends to blue (the higher the number of crimes). This is however not the case, except for three areas north-west of Santa Monica.
## Mapping Number of Crimes VS Average Household Size
centroids <- as.data.frame(gCentroid(zCrimes,byid=TRUE))
qpal <- colorQuantile("YlGnBu", zCrimes$crimes, n = 5)
qHH <- colorQuantile("YlOrRd", zCrimes$Average.Household.Size, n = 5)
leaflet(zCrimes) %>% addPolygons(weight = 1, smoothFactor = 0.5,
opacity = 1.0, fillOpacity = 0.5,
color = ~qpal(crimes),
highlightOptions = highlightOptions(color = "white", weight = 2,
bringToFront = TRUE)) %>%
addCircles(lng = ~centroids$x, lat = ~centroids$y, weight = 1, color = ~qHH(Average.Household.Size),
radius = 800, popup = ~zCrimes$zip, opacity = 0.9, fillOpacity = 0.8 ) %>%
addTiles() %>%
addLegend(pal = qpal, values = ~crimes, opacity = 1) %>%
addLegend(pal = qHH, values = ~Average.Household.Size, opacity = 1)
Similarily, for median age against crime rates no particular relationship can be read from the plot.
## Mapping Number of Crimes VS Median Age
qMA <- colorQuantile("YlOrRd", zCrimes$Median.Age, n = 5)
leaflet(zCrimes) %>% addPolygons(weight = 1, smoothFactor = 0.5,
opacity = 1.0, fillOpacity = 0.5,
color = ~qpal(crimes),
highlightOptions = highlightOptions(color = "white", weight = 2,
bringToFront = TRUE)) %>%
addCircles(lng = ~centroids$x, lat = ~centroids$y, weight = 1, color = ~qMA(Median.Age),
radius = 800, popup = ~zCrimes$zip, opacity = 0.9, fillOpacity = 0.8 ) %>%
addTiles() %>%
addLegend(pal = qpal, values = ~crimes, opacity = 1) %>%
addLegend(pal = qMA, values = ~Median.Age, opacity = 1)
In the following is an attempt of exploring the relationships based on centroids and standard deviation ellipses (SDE). The independent variables are split into two groups. Above (red) VS below (orange) 40 for the median age, above (red) VS below (orange) 3 for the average household size.
## centrographic stats for number of crimes VS median age
zCrimes@data$id = rownames(zCrimes@data)
crimePoints = fortify(zCrimes, region="id")
crimesDf = join(crimePoints, zCrimes@data, by="id")
zCrimesLoc <- cbind(zCrimes@data, centroids)
ggplot(crimesDf) +
geom_polygon(aes(long,lat,group=group, fill = crimes)) +
coord_equal() + coord_fixed(1.3) +
stat_ellipse(data = subset(zCrimesLoc, Median.Age > 40), aes(x = x, y = y), level=0.5, color = "red") +
geom_point(data = subset(zCrimesLoc, Median.Age > 40), aes(x = mean(x), y = mean(y)), color = "red", size = 0.5) +
stat_ellipse(data = subset(zCrimesLoc, Median.Age <= 40), aes(x = x, y = y), level=0.5, color = "orange") +
geom_point(data = subset(zCrimesLoc, Median.Age <= 40), aes(x = mean(x), y = mean(y)), color = "orange", size = 0.5) +
theme_void()
number of crimes VS median age
The plot above shows that younger people spread in the north and in the south of the city mostly, whereas older people spread along a west-east axis. Given that the number of crimes is heterogeneous along any of those axis, a conclusion is again hard to draw on the effect of age on crime rates. The plot below, illustrates the relationship between crimes and household size.
## centrographic stats for number of crimes VS average household size
ggplot(crimesDf) +
geom_polygon(aes(long,lat,group=group, fill = crimes)) +
coord_equal() + coord_fixed(1.3) +
stat_ellipse(data = subset(zCrimesLoc, Average.Household.Size > 3), aes(x = x, y = y), level=0.5, color = "red") +
geom_point(data = subset(zCrimesLoc, Average.Household.Size > 3), aes(x = mean(x), y = mean(y)), color = "red", size = 0.5) +
stat_ellipse(data = subset(zCrimesLoc, Average.Household.Size <= 3), aes(x = x, y = y), level=0.5, color = "orange") +
geom_point(data = subset(zCrimesLoc, Average.Household.Size <= 3), aes(x = mean(x), y = mean(y)), color = "orange", size = 0.5) +
theme_void()
number of crimes VS average household size
Households, whether big or small, seem to be spread equally along the city, since both SDEs have similar shape. The following chapter deals with the spatial data with OLS methods.
The variables Average Household Size and Median Age will be considered in the model, given that they both relate to the hypotheses mentioned in the introduction. At first an ordinary a non-spatial OLS regression is performed, the assumptions for a BLUE estimator tested and its residuals mapped, before comparing these with the residuals of a lag or error term to control for spatial autocorrelation. Finally, a Geographically Weighted Regression Model is fitted on the same data in order to account for any inherent spatial autocorrelation.
Below is a regression analysis for the two independent variables (average household size and median age). It is deceiving in terms of explained variance and of the explanatory value of the independent variable median age. Average household size is significantly different from zero at a 5% confidence level though: a bigger household would imply more crimes as stated in the hypothesis in the introduction. Generally a high R-squared is a sign of possible multicollinearity. Although R-squared is particularily low here, one might still need to perform formal tests in order to assess the absence of multicollinearity.
regHH <- lm(log(crimes) ~ Average.Household.Size + Median.Age, data = zCrimes)
stargazer(regHH, no.space = T, type = "html")
| Dependent variable: | |
| log(crimes) | |
| Average.Household.Size | 0.675** |
| (0.305) | |
| Median.Age | 0.005 |
| (0.028) | |
| Constant | 3.398** |
| (1.441) | |
| Observations | 242 |
| R2 | 0.020 |
| Adjusted R2 | 0.012 |
| Residual Std. Error | 3.993 (df = 239) |
| F Statistic | 2.457* (df = 2; 239) |
| Note: | p<0.1; p<0.05; p<0.01 |
Several multicollinearity diagnostic measures exist. Although the focus will be on condition number here, it can be seen from the table below than any other standard test would deny the existence of substancial multicollinearity.
# Multicollinearity
mc <- omcdiag(zCrimes@data[colnames(zCrimes@data) %in% c("Average.Household.Size", "Median.Age")], log(zCrimes$crimes))
kable(mc$odiags, "html", booktabs = T, caption = "multicollinearity tests") %>%
kable_styling("striped")
| results | detection | |
|---|---|---|
| Determinant | 0.9881100 | 0 |
| Farrar Chi-Square | 2.8288247 | 0 |
| Red Indicator | 0.1090411 | 0 |
| sum of Lambda Invers | 2.0240660 | 0 |
| Theil Indicator | 0.0036336 | 0 |
| Condition Number | 11.8175111 | 0 |
A second assumption required for OLS is homoskedasticity of the errors. The Breusch-Pagan test below shows that the null hypothesis (constant variance) can not be rejected at the \(5\%\) confidence level.
# HETEROSKEDASTICITY
# ols_test_bartlett(as.data.frame(zCrimes@data$"Average.Household.Size"), as.data.frame(zCrimes@data$"Median.Age"))
bp <- ols_test_breusch_pagan(regHH)
m <- matrix(c(bp$bp, bp$p, 1))
rownames(m) <- c("Chi2", "Prob > Chi2", "DF")
kable(m, "html", booktabs = T, row.names = T, caption = "Breusch-Pagan test") %>% kable_styling("striped")
| Chi2 | 0.0650858 |
| Prob > Chi2 | 0.7986310 |
| DF | 1.0000000 |
Once classical OLS assumptions are verified, it remains to see if the spatial element plays a role. Spatial dependence is measured by Moran’s I test for residual spatial autocorrelation and Langrange Multiplier tests. All tests below are different ways of testing whether a weighted regression with spatial weights would imply significant spatial lag or spatial coefficients: for \(y = X \beta + \rho W y + u\), with \(u = \lambda W u + e\) and with \(W\) a spatial weights matrix; we test whether \(\rho = 0\) (spatial lag) or we test whether \(\lambda = 0\) (spatial error). Note that the SARMA tests for the existence of both at the same time. Given that all p-values are significant at the \(5\%\) confidence level, we can confidently reject the latter hypothesis and conclude that a spatial interaction is existant given a strong spatial autocorrelation of the residuals.
# Spatial Dependence
spatmatrix <- poly2nb(zCrimes)
w <- nb2listw(spatmatrix, zero.policy = T)
# Moran's I
moran <- lm.morantest(regHH, listw = w, zero.policy = T)# Moran’s I test for residual spatial autocorrelation
moranStat <- matrix(c(moran$statistic, moran$p.value, 1))
rownames(moranStat) <- c("statistic", "p.value", "parameter")
colnames(moranStat) <- "Moran's I (error)"
# Lagrange Multiplier test
LM <- lm.LMtests(regHH, listw = w, zero.policy = T, test = c("LMerr", "RLMerr", "LMlag", "RLMlag", "SARMA"))
LMrows <- c("statistic", "p.value", "parameter")
m <- matrix(NA, nrow = 3, ncol = length(LM))
rownames(m) <- LMrows
colnames(m) <- c("Lagrange Multiplier (error)", "Robust LM (error)", "Lagrange Multiplier (lag)", "Robust LM (lag)", "Lagrange Multiplier (SARMA)")
# colnames(m) <-
# everything in one table
for(i in 1:length(LM)){
r <- 1
for(j in LMrows){
m[r, i] <- LM[[i]][[j]]
r <- r + 1
}
}
m <- cbind(moranStat, m)
kable(m, "html", booktabs = T, row.names = T, caption = "spatial dependence tests") %>%
kable_styling("striped")
| Moran’s I (error) | Lagrange Multiplier (error) | Robust LM (error) | Lagrange Multiplier (lag) | Robust LM (lag) | Lagrange Multiplier (SARMA) | |
|---|---|---|---|---|---|---|
| statistic | 6.176374 | 36.22929 | 4.1895123 | 41.95159 | 9.9118129 | 46.1411 |
| p.value | 0.000000 | 0.00000 | 0.0406748 | 0.00000 | 0.0016422 | 0.0000 |
| parameter | 1.000000 | 1.00000 | 1.0000000 | 1.00000 | 1.0000000 | 2.0000 |
As a last model check, residuals are plotted below. Some clustering seems to occur in different parts of the city and especially along the coast.
# Residuals
#extract residuals
zCrimes@data$lmRes[!is.na(zCrimes$crimes)] <- resid(regHH) #residual lm
#view residuals for linear model
qpal <- colorQuantile("OrRd", zCrimes@data$lmRes, n=9)
leaflet(zCrimes) %>%
addPolygons(stroke = FALSE, fillOpacity = .8, smoothFactor = 0.2, color = ~qpal(lmRes)
) %>%
addTiles()
In order to deal with spatial autocorrelation, an OLS with spatial lag and an OLS with spatial errors are fitted. Model selection will be based on AIC.
Spatial lag can indicate a diffusion process: local events predict an increased likelihood of similar events in neighboring places.
#run a spatial lag model
lagModel <- lagsarlm(log(crimes) ~ Average.Household.Size + Median.Age, data = zCrimes, w,
zero.policy = T)
stargazer(lagModel, no.space = T, type = "html")
| Dependent variable: | |
| log(crimes) | |
| Average.Household.Size | 0.411 |
| (0.274) | |
| Median.Age | -0.0001 |
| (0.025) | |
| Constant | 1.692 |
| (1.350) | |
| Observations | 242 |
| Log Likelihood | -658.794 |
| sigma2 | 12.833 |
| Akaike Inf. Crit. | 1,327.587 |
| Wald Test | 46.066*** (df = 1) |
| LR Test | 36.270*** (df = 1) |
| Note: | p<0.1; p<0.05; p<0.01 |
#note that rho is listed below other model coefs.
No parameter is significantly different from zero in the lag model above, but it’s AIC (1327.59) is lower than for the original model (1361.86), indicating better fit.
Spatial error implies omitted covariates that are spatially correlated and that affect inference.
#run a spatial error model
errModel <- errorsarlm(log(crimes) ~ Average.Household.Size + Median.Age, data = zCrimes, w, zero.policy = T)
stargazer(errModel, no.space = T, type = "html")
| Dependent variable: | |
| log(crimes) | |
| Average.Household.Size | 0.233 |
| (0.271) | |
| Median.Age | -0.003 |
| (0.025) | |
| Constant | 4.936*** |
| (1.300) | |
| Observations | 242 |
| Log Likelihood | -660.183 |
| sigma2 | 12.986 |
| Akaike Inf. Crit. | 1,330.367 |
| Wald Test | 44.821*** (df = 1) |
| LR Test | 33.490*** (df = 1) |
| Note: | p<0.1; p<0.05; p<0.01 |
When running the spatial error model above, AIC (1330) goes up by a little, but the intercept becomes significantly different than zero. The intercept carries little inferential information, therefore this will not be considered as a gain compared to the lag model. The lag model will be preferred due to its lower AIC. Below, the residuals for the lag model and for the error model are plotted. They can be compared with the plot of the residuals of the original model further above.
#extract residuals for lag model
zCrimes@data$lagRes[!is.na(zCrimes$crimes)] <- resid(lagModel) #residuals lag
#view residuals for lag model
qpal<-colorQuantile("OrRd", zCrimes@data$lagRes, n=9)
title <- tags$div(tag.map.title, HTML("spatial lag model residuals"))
leaflet(zCrimes) %>%
addPolygons(stroke = FALSE, fillOpacity = .8, smoothFactor = 0.2, color = ~qpal(lagRes)
) %>%
addTiles() %>%
addControl(title, position = "topleft", className="map-title")
#extract residuals for err model
zCrimes@data$errRes[!is.na(zCrimes$crimes)] <- resid(errModel) #residual err
#compare with residuals for err model
qpal<-colorQuantile("OrRd", zCrimes@data$errRes, n=9)
title <- tags$div(tag.map.title, HTML("spatial error model residuals"))
leaflet(zCrimes) %>%
addPolygons(stroke = FALSE, fillOpacity = .8, smoothFactor = 0.2, color = ~qpal(errRes)
) %>%
addTiles() %>%
addControl(title, position = "topleft", className="map-title")
Another way of dealing with the intrinsic spatial interactions is to fit a geographically weighted regression model.
centr <- gCentroid(zCrimes, byid = TRUE) #byid tells function to add centroids per block group
# create SpatialPointsDataFrame to add all data from chi.poly@data to centr@data
centr <- SpatialPointsDataFrame(centr, data= zCrimes@data)
#Begin GWR prep: First, calculate bandwidth
#calculate optimal bandwidth for model (uses cross validation to find kernel bandwidth that
#generates best model...e.g.-kernel that minimizes RSS)
GWRbandwidth <- gwr.sel(log(crimes) ~ Average.Household.Size + Median.Age, data=centr, verbose = F,
coords=coordinates(centr),adapt=T)# coordinates() extracts x, y coordinates
#adapt tells function to calculate adaptive bandwidth
#run the gwr model
gwr.model <- gwr(log(crimes) ~ Average.Household.Size + Median.Age,
data=subset(centr, !is.na(crimes)), coords=coordinates, adapt=GWRbandwidth,
hatmatrix=TRUE, se.fit=TRUE)
#print the results of the model
#extract results for each variable
results<-as.data.frame(gwr.model$SDF)
#attach coefficients to original dataframe
centr$Median.AgeCoef[!is.na(zCrimes$crimes)] <- results$Median.Age
centr$Average.Household.SizeCoef[!is.na(zCrimes$crimes)] <- results$Average.Household.Size
#now plot the various GWR coefficients
#Does the median age influence violent crime uniformly across LA?
title <- tags$div(tag.map.title, HTML("median age influence on violent crime"))
qpal<-colorQuantile("OrRd", centr@data$Median.Age, n=9)
leaflet(centr) %>%
addCircleMarkers(radius=3,color = ~qpal(centr@data$Median.Age)
) %>%
addTiles() %>%
addControl(title, position = "topleft", className="map-title")
The plot above shows where median age influence violent crime across LA. It seems that patches of strongly influenced neighborhoods exist, implying that the influence is not uniform accross the city.
#Does the average household size influence violent crime uniformly across LA?
title <- tags$div(tag.map.title, HTML("average household size influence on violent crime"))
qpal<-colorQuantile("OrRd", centr@data$Average.Household.Size, n=9)
leaflet(centr) %>%
addCircleMarkers(radius=3,color = ~qpal(centr@data$Average.Household.Size)
) %>%
addTiles() %>%
addControl(title, position = "topleft", className="map-title")
The plot above shows where average household size influence violent crime across LA. Similarily, it seems that patches of strongly influenced neighborhoods exist. A clearer pattern is distinguishable: the outskirts of the city are more influenced by average household size. This might be due to the fact that poorer people live in the outskirts and that a bigger household has a more significant impact on the ability to foster and educate children.
The two hypotheses stated above can not be definitely confirmed by the statistical analysis. It seems that the heterogeneity in crime data (as can be seen in the plots above) is not well modelled by median age. Average household size seems to bear more inferential power: in the linear regression, a higher average household size did involve more crimes. More specifically, as implied by the geographically weighted regression, a higher average household size in the suburbs of LA implied more violent crimes. Note that in future studies other variables that are available in the crime and Zip datasets could be further investigated, in order to better model and predict crime rates.
https://ucr.fbi.gov/crime-in-the-u.s/2015/crime-in-the-u.s.-2015/offenses-known-to-law-enforcement/violent-crime↩
https://www.businessinsider.com/dea-maps-of-mexican-cartels-in-the-us-2016-12↩
https://catalog.data.gov/dataset/crime-data-from-2010-to-present↩
https://catalog.data.gov/dataset/crime-data-from-2010-to-present↩
https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2018&layergroup=ZIP+Code+Tabulation+Areas↩